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Abstract. Static analyzers based on abstract interpretation are complex 
pieces of software implementing delicate algorithms. Even if static analysis 
techniques are well understood, their implementation on real languages 
^^ is still error-prone. 

^^ This paper presents a formal verification using the Coq proof assistant: a 

^\j formalization of a value analysis (based on abstract interpretation), and a 

^H soundness proof of the value analysis. The formalization relies on generic 

interfaces. The mechanized proof is facilitated by a translation validation 
of a Bourdoncle fixpoint iterator. 

The work has been integrated into the CompCert verified C-compiler. Our 

verified analysis directly operates over an intermediate language of the 

^ compiler having the same expressiveness as C. The automatic extraction 

of our value analysis into OCaml yields a program with competitive 
results, obtained from experiments on a number of benchmarks and 
comparisons with the Frama-C tool. 



> 

X^ 1 Introduction 

en 



Over the last decade, significant progress has been made in developing tools to 
support mathematical and program-analytic reasoning. Proof assistants like ACL2, 

^T) Coq, HOL, Isabelle and PVS are now successfully applied both in mathematics 

^-H (e.g., a mechanized proof of the 4-colour theorem [ ] and of the Feit-Thompson 

K^ theorem [ ]) and in formal verification of critical software systems (e.g., the 

•i-H CompCert C-compiler [ ] and the verified operating system kernel seL4 [lo]). 

/\^ Over the same time, automatic verification tools based on model-checking, 

static analysis and program proof have become widely used by the critical software 
industry. The main reason for their success is that they strengthen the confidence 
we can have in critical software by providing evidence of software correctness. The 
next step is to strengthen the confidence in the results of these verification tools, 
and proof assistants seem to be mature and adequate for this task. This paper 
presents a foundational step towards the formal verification of a static analysis 
based on abstract interpretation [10]: the formal verification using the Coq proof 
assistant of a value-range analysis operating over a real- world language. 
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Static analyzers based on abstract interpretation are complex pieces of software 
that implement delicate symbolic algorithms and numerical computations. Their 
design requires a deep understanding of the targeted programming language. 
Misinterpretations of the programming language informal semantics may lead to 
subtle soundness bugs that may be hard to detect by using only testing techniques. 
Implementing a value analysis raises specific issues related to low-level numeric 
computations. First, the analysis must handle the machine arithmetic that is 
(more or less) defined in the programming language. Second, some computations 
done by the analyzer rely on this machine arithmetic. 

Thus, a prerequisite for implementing a static analyzer operating over a C-like 
language is to rely on a formal semantics of the programming language defining 
precisely the expected behaviors of any program execution (and including low- 
level features such as machine arithmetic). Such formal semantics are defined in 
the CompCert compiler (and it is unusual for a compiler). More precisely, each 
language of the compiler is defined by a formal semantics (in Coq) associating 
observable behaviors to any program. Observable behaviors include normal 
termination, divergence, abnormal termination and undefined behaviors (such as 
out-of-bounds array access). We have chosen one language of the compiler (the 
main intermediate language that has the same expressiveness as C, see Section 2) 
and we have formalized a static analyzer operating over this language. The 
advantage of this approach is that our analyzer as well as the formal semantics 
operate exactly over the same language. 

The main peculiarity of the CompCert C-compiler is that it is equipped with 
a proof of semantic preservation [^ ,]. This proof is made possible thanks to the 
formal semantics of the languages of the compiler. The proof states that any 
compiled program behaves exactly as specified by the semantics of its original 
program. It consists of the composition of correctness proofs for each compiler 
pass and thus involves reasoning on the different intermediate languages of the 
compiler. 

All results presented in this paper have been mechanically verified using 
the Coq proof assistant. The complete Coq development is available online at 
http://www.irisa.fr/celtique/ext/value-analysis. 

The paper makes the following contributions. 

— It provides the first verified value analysis for a realistic language such as C 
and hence demonstrates the usability of theorem proving in static analysis of 
real programs. 

— It presents a modular design with strong interfaces aimed at facilitating any 
further extension. 

— It provides a reference description of basic techniques of abstract inter- 
pretation and thus gives advice on how to use the abstract interpretation 
methodology for this kind of exercice while maintaining a sufficiently low 
cost in terms of formal proof effort. 

— It compares the performances of our tool (that has been generated automat- 
ically from our formalization and integrated into the CompCert compiler) 
with those of two interval-based value analyzers for C. 



The paper exposes many examples taken from the formal development. It 
is structured to follow the development of a C value analysis based on abstract 
interpretation; from generic abstract domains (section 3), to fixpoint resolution 
(section 4) and numerical and memory abstractions (sections 5 and 6). Section 7 
describes the experimental evaluation of our implementation. Related work is 
discussed in Section 8, followed by concluding remarks. 

2 Background 

This section starts with a short introduction to the Coq proof assistant. It is 
followed by a brief presentation of the CompCert architecture and memory model. 
The language our analyzer operates over is described at the end of this section. 

2.1 Short Introduction to Coq 

Coq is an interactive theorem prover. It consists in a strongly typed specification 
language and a language for conducting machine-checked proofs interactively. 
The Coq specification language is a functional programming language as well 
as a language for inductively defining mathematical properties, for which it has 
a dedicated type (Prop). Induction principles are automatically generated by 
Coq from inductive definitions, thus inductive reasoning is very convenient. Data 
structures may consist of properties together with dependent types. Coq's type 
system includes type classes. Coq specifications are usually defined in a modular 
way (e.g., using record types and functors, that are functions operating over 
structured data such as records). The user is in charge to interactively build 
proofs in the system but those proofs are automatically machine-checked by the 
Coq kernel. OCaml programs can be automatically generated by Coq from Coq 
specifications. This process is called extraction. 

2.2 The CompCert Memory Model 

There are 11 languages in the CompCert compiler, including 9 intermediate 
languages. These languages feature both low- level aspects such as pointers, 
pointer arithmetic and nested objects, and high-level aspects such as separation 
and freshness guarantees. A memory model [21] is shared by the semantics of 
all these languages. Memory states (of type mem) are collections of blocks, each 
block being an array of abstract bytes. A block represents a C variable or an 
invocation of malloc. Pointers are represented by pairs (b,i) of a block identifier 
and a byte offset i within this block. Pointer arithmetic modifies the offset part 
of a pointer value, keeping its block identifier part unchanged. 

Values stored in memory are the disjoint union of 32-bit integers (written as 
vint(i)), 64-bit floating-point numbers, locations (written as vptr(b,i)), and 
the special value undef representing the contents of uninitialized memory. Pointer 
values vptr (b,i) are composed of a block identifier b and an integer byte offset i 
within this block. Memory chunks appear in memory operations load and store, 
to describe concisely the size, type and signedness of the value being stored. 



Values: v ::= vint(z) | vfloat(/) | vptr(6, z) 

I undef 

Mem. chunks: k ::= MintSsigned | Mint Suns igned 8-bit integers 

I Mintl6signed | Mintl6unsigned 16-bit integers 

I Mint 32 32-bit integers or pointers 

I Mf loat32 32-bit floats 

Mf loat64 64-bit floats 



In CompCert, a 32-bit integer (type int) is defined as a Coq arbitrary- 
precision integer (type z) plus a property called intrange that it is in the range 
to 2^^ (excluded). The function signed (resp. unsigned) gives an interpretation of 
machine integers as a signed (resp. unsigned) integer. The properties signed_range 
and unsigned_range are examples of useful properties for machine integers. 

Definition max_unsigned : Z := 2"^^ - 1. 
Definition max_signed : Z := 2^"*" - 1. 
Definition min_signed : Z := - 2^^. 
Record int := { intval : Z; 

intrange : < intval < 2 } . 
Definition unsigned (n: int) : Z := intval n. 

Definition signed (n: int) : Z : = if unsigned(n) < 2^^ then unsigned(n) 

else unsigned (n) - 2^^. 
Theorem signed_range : V i, min_signed < signed(i) < max_signed. 

Proof. (* Proof commands omitted here *) Qed . 
Theorem unsigned_range : V i, < unsigned(i) < max_unsigned. 

Proof. (* Proof commands omitted here *) Qed . 



2.3 The CFG Intermediate Language 

The main intermediate language of the CompCert compiler is called Cminor, a 
low-level imperative language structured like C into expressions, statements and 
functions. Historically, Cminor was the target language of the compiler front-end. 
There are four main differences with C [ ]. First, arithmetic operators are not 
overloaded. Second, address computations are explicit, as well as memory access 
(using load and store operations). Third, control structures are if-statements, 
infinite loops, nested blocks plus associated exits and early returns. Last, local 
variables can only hold scalar values and they do not reside in memory, making it 
impossible to take a pointer to a local variable hke the C operator & does. Instead, 
each Cminor function declares the size of a stack-allocated block, allocated 
in memory at function entry and automatically freed at function return. The 
expression addrstack (n) returns a pointer within that block at constant offset n. 
Cminor was designed to be the privileged language for integrating within 
CompCert other tools operating over C and other compiler front-ends. For 
instance, two front-end compilers from functional languages to Cminor have been 
connected to CompCert using Cminor, and a separation logic has been defined for 



Cminor [ ]. The Concurrent Cminor language extends Cminor with concurrent 
features and hes at the heart of the Verified Software Toolchain project [ ]. 

As control-flow is still complex in Cminor (due to the presence of nested 
blocks and exits), we have first designed a new intermediate language called CFG 
that is adapted for static analysis: 1) its expressions are Cminor expressions (i.e., 
side-effect free C expressions), 2) its programs are represented by their control 
flow graphs, with explicit program points and 3) the control flow is restricted 
to simple unconditional and conditional jumps. The CFG syntax is deflned in 
Figure 1. Floating-point operators are omitted in the flgure, as our analysis 
does not compute any information about floats. Statements include assignment 
to local variables, memory stores, if-statements and function calls. Expressions 
include reading local variables, constants and arithmetic operations, reading store 
locations, and conditional expressions. As in the memory model, loads and stores 
are parameterized by a memory chunk k,. 

The CFG language is integrated into the CompCert compiler, as shown in 
Figure 2. There is a translation from Cminor to CFG and a theorem stating that 
any terminating or diverging execution of a CFG program is also a terminating or 
diverging execution of the original Cminor program. Thus, instead of analyzing 
Cminor programs, we can analyze CFG programs and use this theorem to 
propagate the results of the CFG analysis on Cminor programs. For instance, in 
order to show that Cminor is memory safe, we only need to show that CFG is 
memory safe. 

For the purpose of the experiments that we conduct in Section 7, we use an 
inlining pass recently added to the CompCert compiler. It was implemented and 
proved correct by X.Leroy for another language of the compiler, RTL, that is 
similar to CFG except that it only handles flat expressions. Since our analysis 
operates on CFG, we have adapted this inlining pass to CFG. Adapting the 
soundness proof of this transformation to CFG has been left for future work. 

The concrete semantics of CFG is defined in small-step style as a transition 
relation between execution states. An execution state is a tuple called a. Among 
the components of a are the current program point (i.e., a node in the control-fiow 
graph), the memory state (type mem) and the environment (type env) mapping 
program variables to values. We use a.E to denote the environment of a state a, 
and dom{(j.E) to denote its domain (i.e., the set of its variables). We use reach (P) 
to denote the set of states belonging to the execution trace of P. 

Our value analysis (called value_analysis) computes for each program point 
the estimated values of the program variables. When the value of a variable is 
an integer i or a pointer value of offset i, the estimate provides two numerical 
ranges signed_range and unsigned_range . The first one over- approximates the 
signed interpretation of i and the other range over- approximates its unsigned 
interpretation. We note ints_in_range ( signed_range , unsigned_range ) i this 
fact. Thus, given a program P, value_analysis (P) yields a map such that for 
each node / in its control flow graph and each variable v^ valueanalysis (P)[/, v] 
is a pair of sound ranges for v. The following theorem states the soundness of the 
value analysis: for every program state that may be reached during the execution 



Constants: c : 


■■=n\f 


integer and floating-point constants 




addrsymbol(zc?, n) 


address of a symbol plus an offset 




addrstack(n) 


stack pointer plus a given offset 


Expressions: a : 


:= id 


variable identifier 




c 


constant 




opi a 


unary arithmetic operation 




ai 0P2 tt2 


binary arithmetic operation 




ai? a2 : as 


conditional expression 




load(Ai:, a) 


memory load 


Unary op.: op^ : 


:= castSunsigned 


8-bit zero extension 




castSsigned 


8-bit sign extension 




castl6unsigned 


16-bit zero extension 




castl6signed 


16-bit sign extension 




boolval 


if null, 1 if non-null 




negint 


integer opposite 




notbool 


boolean negation 




notint 


bitwise complement 


Binary op.: op2 : 


:= + 1 - 1 * 1 / 1 7o 


arithmetic integer operators 




1 « 1 » 1 & 1 1 1 ^ 


bitwise operators 




/u 1 "/on 1 »u 


unsigned operators 




cmp(6) 


integer signed comparisons 




cmpu(6) 


integer unsigned comparisons 


Comparisons: b : 


:=< 1 <= 1 > 1 >= |==| ! = 


relational operators 


Statements: i : 


:= skip(/) 


no operation (go to 1) 




assign(zc?, a, /) 


assignment 




store(/^, a, a, /) 


memory store 




±f{e, Itrue, I false) 


if statement 




caJ.l{sig, i(f , a, a*, /) 


function call 




return(a)" 


function return 



Fig. 1. Abstract syntax of CFG 




Fig. 2. Integration of the value analysis in the CompCert toolchain 



of a program, any program point and variable, every variable valuation computed 
by the analysis is a correct estimation of the exact value given by the concrete 
semantics. 

Theorem 1 (Soundness of the value analysis). Let P be a program, a G 
reach(P) and res = value_analysis (P) be the result of the value analysis. Then, 
for each program point I, for each local variable v G dom{o-.E) that contains 
an integer i {i.e., a.E{v) = vint(i) V 3b^a.E{v) = vptr(6, i))^ the property 
(ints_in_range res[l,v] i) holds. 

2.4 Overview of a Modular Value Analysis 

Our value analysis is designed in a modular way: a generic fixpoint iterator 
operates over generic abstract domains (see Section 3). The iterator is based on 
the state-of-the-art Bourdoncle [ ] algorithm that provides both efficiency and 
precision (see Section 4). 

The modular design of the abstract domains is inspired from the design of 
the Astree analyzer. It consists in three layers that are showed in Figure 3. The 
simplest domains are numerical abstract domains made of intervals of machine 
integers. These domains are not aware of the C memory model. 



Signed ■ Unsigned 
Intervals I Intervals 



Congruence Abstract 
Donnain 




Fig. 3. Design of abstract domains: a three-layer view 



In a C program, a same piece of data can be used both in signed and unsigned 
operations, and the results of these operations differ from one interpretation to the 
other. Thus, we have two numerical abstract domains, one for each interpretation. 
Our analysis computes the reduced product of the two domains in order to 
make a continuous fruitful information exchange between these two domains (see 
Section 5). 

Then, we build abstract domains representing numerical environments. We 
provide a non-relational abstraction that is parameterized by a numerical ab- 
stract domain. The last layer is the abstract domain representing memory. It is 



parameterized by the previous layer and links the abstract interpreter with the 
numerical abstract domains (see Section 6). 

This modular design is targeted to connect at each layer other abstract 
domains. They are represented in dotted lines in Figure 3. For example, several 
abstract memory models can be used instead of the current one while maintaining 
the same interfaces with the rest of the formal development. The ultimate goal is to 
enhance our current abstract interpreter in order to connect it to a memory domain 
a la Mine [ ]. The current interfaces are also compatible with any relational 
numerical abstract domain. At the top, more basic numerical abstractions as 
congruence could be added and plugged into our reduced product. 



3 Abstract Domain Library 

This section describes the library we have designed to represent our abstract 
domains. First, it defines generic abstract domains. Then, it details the interval 
abstract domain. Last, it explains how to combine abstract domains. 

3.1 Abstract Domain Interface 

Abstract interpretation provides various frameworks [ ] for the design of abstract 
semantics. The most well-known framework is based on Galois connections but 
some relaxed frameworks exist. They are generally used when some useful abstrac- 
tion does not fulfill standard properties (e.g., polyhedral abstract domains [ ] 
do not form a complete lattice). In our context, a relaxed framework is required 
because of the associated lightweight proof effort. 

Since our main goal is to provide a formal proof of soundness for the result of an 
analysis, some additional properties such as best approximation or completeness 
do not require a machine checked proof. In some previous work of the last author, 
a framework has been defined for the purpose of machine checked proofs [ ] . 
In this paper, we push further this initiative and provide a more minimalist 
framework. The signature of abstract domains is of the following form. ^ 

Notation 7^( A) := (A ^ Prop). (* identify sets and predicates *) 

Notation x G P := (P x) . 

Notation PI C P2 := (incl PI P2) . (* property inclusion *) 

Record adorn (A: Type) (BiType) : Type := { 

le : A ^ A ^ bool; (* partial order test *) 

top: A; (* greatest element *) 

join: A ^ A ^ A; (* least upper bound *) 

widen: A ^ A ^ A; (* widening operator *) 

gamma: A -^ 7^(B) ; (* concretization function *) 

gamma_monotone : (* monotonicity of gamma *) 

V al a2 , le al a2 = true -^ (gamma al) C (gamma a2) ; 

top_sound: (* top over-approximates any *) 



3 



In this paper, for the sake of simplicity, we only use records to structure our formal- 
ization. However, in our development, we also use more advanced Coq features such 
as type classes. 



V X, X G (gamma top); (* concrete property *) 

join_sound: V x y:A, (* join over-approximates *) 

(gamma x) U (gamma y) C gamma (join x y) ; (* concrete union *) 
}. 

Here, A is the type of abstract values, B is the type of concrete values, and 
the type of the abstract domain is (adom A B). This type is a record with 
various operators (described on the right part) and properties about them. This 
record contains only three properties: the monotonicity of the gamma operator, 
the soundness of the top element and the soundness of the least upper bound 
operator join. We do not provide formal proof relating the abstract order with 
top or join. Indeed any weak-join will be suitable here. The lack of properties 
about the widening operator is particularly surprising at first sight. In fact, as 
we will explain in Section 4, the widening operator is used only during fixpoint 
iteration and this step is validated a posteriori. Thus, only the result of this 
iteration step is verified and we don't need a widening operator for that purpose. 

The gamma operator of every abstract domain will be noted 7. The type class 
mechanism enables Coq to infer which domain it refers to. 



3.2 Example of Abstract Domain: Intervals 

Our value analysis operates over compositions of abstract domains. The most 
basic abstract domain is the domain of intervals. Figure 4 defines the abstract 
domain of intervals made of machine integers, that are interpreted as signed 
integers. This instance is called signed_itv_adom. The definitions are standard 
and only some of them are detailed in the figure. An interval represents the 
range of the signed interpretation of a machine integer. Thus, top is defined as 
the largest interval with bounds min_signed and max_signed. The concretization 
is defined as follows. A machine integer n belongs to the concretization of an 
interval itv iff signed (n) belongs to itv. The proof of the lemma topsound 
follows from the signed_range theorem given in Section 2.2. 



Record itv := {min : Z; max: Z} . 

Definition signed_itv_adom : adom itv int := { 

le := (A itvl itv2, ...); (* definition omitted here *) 

top := { min:= min_signed; max:= max_signed}; 

join := (A itvl itv2 , ...); (* definition omitted here *) 

widen := (A itvl itv2 , ...); (* definition omitted here *) 

gamma := (A itv n, itv. min < signed (n) < itv. max); 

top_sound :=(...); (* proof term omitted here *) 

gamma_monotone :=(...); (* proof term omitted here *) 

join_sound :=(...); (* proof term omitted here *) 

}. 

Fig. 4. An instance called signed_itv_adom: the domain of intervals (made of 
signed machine integers) with a concretization to ^(int). 



We also define a variant of tiiis domain with a concretization using an unsigned 
interpretation of machine integers: (A itv n, itv.min<unsigned(n)<itv.max). 
As explained in Section 5, combining both domains recovers some precision that 
may be lost when using only one of them. 

The itv record type provides only lower and upper bounds of type Z. Using 
the expressiveness of the Coq type system, we could choose to add an extra field 
requiring a proof that min < max holds. While elegant at first sight, this would 
be rather heavyweight in practice, since we must provide such a proof each time 
we build a new interval. For the kind of proofs we perform, if such a property 
was required, we would generally have an hypothesis of the form i G (7 itv) in 
our context and it would trivially imply that it v. min < it v. max holds. 

3.3 Abstract Domain Functors 

Our library provides several functors that build complex abstract domains from 
simpler ones. 

Direct Product A first example is the product (adom (A*A') B) of two ab- 
stract domains (adom A B) and (adom A' B), where the concretization of a pair 
(a,a'):A*A' is the intersection (7 a) n (7 a'). 

Lifting a Bottom Element A bottom element is not mandatory in our definition 
of abstract domains because some sub-domains do not necessarily contain one. 
For instance, the domain of intervals does not contain such a least element. Still 
in our development, the bottom element plays a specific and important role since 
we use it for reduction. We hence introduce a polymorphic type A+± that lifts a 
type A with an extra bottom element called Bot. We then define a simple functor 
lift_bot that lifts any domain (adom A B) on a type A to a domain on A+±. In 
this new domain, the concretization function extends the concretization of the 
input domain and 7 Bot = 0. 

Definition botlift (A: Type): Type := Bot I NotBot (x:A). 
Notation A+± := (botlift A). 

Definition lift_bot (A B: Type): adom A B ^ adom (A+±) B := 
(* definition omitted here *) 

Finite Reduced Map Lifted domains are used for instance as input for an important 
functor of finite maps. CompCert uses intensively the TREE interface. Given 
an implementation T of the interface TREE and a type A, an element of type 
(TREE . t A) represents a partial map from keys (of type T . elt) to values of type A. 
The interface is implemented for several kinds of keys in the CompCert libraries. 
In our development, we use it to map variables to abstract values, but also 
program points to abstract environments. The functor implements the following 
type. 
AbTree.make(T:TREE) (A B:Type): adom A B ^ adom (T.t A)+± (T.elt -^ B) 

An element in (T.t A)+± is turned into a function of type T.elt -^ A+± via the 
function get that satisfies the following equations. 



get(Bot) (k) = Bot 

getCNotBot m) (k) = top (* if m [k] is undefined *) 

getCNotBot m) (k) = NotBot m [k] (* otherwise *) 

As a consequence, the top element is represented in a lazy way: a key is associated 
to it as soon as it is not bound in the partial map. Furthermore, the map is 
reduced w.r.t. the bottom element of the input domain: as soon as we try to bind 
a key to the bottom element, the whole map is shrunk to Bot. This situation is 
interesting for dead code elimination and more generally for the whole precision 
of an analysis. 



4 Fixpoint Resolution 

From a proof point of view, the main lesson learned from the CompCert experi- 
ment is the following. When formally verifying a complex piece of software relying 
on sophisticated data structures and delicate algorithms, it is not realistic to 
write the whole software using exclusively the specification language of the proof 
assistant. A more pragmatic approach to formal verification consists in reusing an 
existing implementation in order to separately verify its results. This approach is 
not optimal, but it is worthwhile when the algorithm is a sophisticated piece of 
code and when the formal verification of each of the results is much easier than 
the formal verification of the algorithm itself. 

The CompCert compiler combines both approaches in order to facilitate the 
proofs. Most of the compiler passes are written and proved in Coq. A few compiler 
passes (e.g., the register allocation [26]) are not written directly in Coq, but 
formally verified in Coq by a translation vaHdation approach. Our value analysis 
also combines both approaches. We have formally verified a checker that validates 
a posteriori the untrusted results of a fixpoint engine written in OCaml, that 
finds fixpoints using widening and narrowing operators. 

As many data flow analyses, our value analysis can be turned into the fixpoint 
resolution of an equation system on a lattice. CompCert already provides a 
classical Kildall iteration framework to iteratively find the least fixpoint of an 
equation system. But using such a framework is impossible here for two reasons. 
First, the lattice of bounded intervals contains very long ascending chains that 
make standard Kleene iterations too slow. Second, the non-monotonic nature of 
widening and narrowing makes fixpoint iteration sensible to the iteration order 
of each equation. 

We have then designed a new fixpoint resolution framework that relies on the 
general iteration techniques defined by Bourdoncle [6]. First, Bourdoncle provides 
a strategy computation algorithm based on Tarjan's algorithm to compute 
strongly connected subcomponents of a directed graph and find loop headers 
for widening positioning. This algorithm also orders each strongly connected 
subcomponent in order to obtain an iteration strategy that iterates inner loops 
until stabilization before iterating outer loops. Bourdoncle then provides an 
efficient fixpoint iteration algorithm that iterates along the previous strategy and 
requires a minimum number of abstract order tests to detect convergence. 



This algorithm rehes on advanced reasoning in graph theory and formally 
verifying it would be highly challenging. This frontal approach would also certainly 
be too rigid because widening iteration requires several heuristic adjustments 
to reach a satisfactory precision in practice (loop unrolling, delayed widenings, 
decreasing iterations). We have therefore opted for a more flexible verification 
strategy: as Bourdoncle strategies, fixpoints are computed by an external tool 
(represented by the function called get_extern_f ixpoint ) and we only formally 
verify a fixpoint checker (called check_fxp). 

Our fixpoint analyzer is defined below, given an abstract domain ab, a pro- 
gram P and its entry point entry, the transfer functions transfer and initial 
abstract values init. 

Definition solve_pfp (ab: adorn t B) (P: PTree.t instruction) 
(entry: node) (transfer: node^instruction^list (node* (t^t) ) ) 
(init: t) : node ^ t := 

let fxp := get_extern_f ixpoint entry ab P transfer init in 
if check_fxp entry ab P transfer init fxp then fxp else top. 

The verification of the fixpoint checker yields the following property: the 
concretization of the result of the solvepf p function is a post-fixpoint of the 
concrete transfer function. That is, given the analysis result fxp, for each node pc 
of the program, applying the corresponding transfer function tf to the analysis 
result yields an abstract value included in the analysis result. 

Lemma solve_pfp_postf ixpoint : V ab entry P transfer init fxp, 
fxp = solve_pfp ab P entry transfer init -^ 

V pc i , P [pc] = i ^ 

V (pc',tf) Giist (transfer pc i) , 7(tf(fxp pc))C7(fxp pc ' ) . 
Proof. (* proof commands are omitted here *) Qed. 



5 Numerical Abstraction 

Following the design of the Astree analyzer [ ] , our value analysis is parameter- 
ized by a numerical abstract domain that is unaware of the C memory model. 
We first present the interface of abstract numerical environments, then how we 
abstract numerical values in order to build non relational abstract environments. 
Finally, we show concrete instances of numerical domains and how they can be 
combined. 

5.1 Abstraction of Numerical Environments 

The first interface captures the notion of numerical environment abstraction. 
Given a type t for abstract values and a notion of variable var (simple positive 
integers in our development), we require an abstract domain that concretizes to 

^(var -^ int) and provide three sound operators range, assign and assume. 

sign_flag ::= Signed I Unsigned 

Definition ints_in_range (r:sign_flag -^ itv+^) : int := 

(7 (r Signed)) H (7 (r Unsigned)). 
Record int_dom (t:Type) := { 



int_adom: adorn t (var -^ int) ; (* abstract domain structure *) 
(* signed/unsigned range of an expression *) 
range: nexpr ^ t ^ sign_flag -^ itv+_L; 
range_sound: V e p ab , 

yo G 7 ab ^ eval_nexpr yo e C ints_in_range (range e ab) ; 
(* assignment of a variable by a numerical expression *) 
assign: var -^ nexpr ^ t ^ t; 
assign_sound: V x e p n ab , 

y9 G 7 ab ^ n G eval_nexpr p e ^ (upd p x n) G 7 (assign x e ab) ; 
(* assume a numerical expression evaluates to true *) 
assume: nexpr ^ t ^ t; 
assume_sound: V e p ab, 

p G 7 ab ^ Ntrue G eval_nexpr p e ^ p G 7 (assume e ab) 
}. 

This interface matches with any implementation of a relational abstract 
domain [12] on machine integers. To increase precision, it relies on a notion of 
expression tree (type nexpr) defined as follows and relying on CFG operators. 

Cfr ••= NEvar id \ NEconst c \ NEunop opi Cfr \ NEbop 0P2 Cfr ^tr \ NEcond e^^ Ctr ^tr 

These expressions are associated with a big-step operational semantics eval_nexpr 
of type (varmint) -^ nexpr -^ 'P(int) that we define as a partial function 
represented by a relation. The semantics is not detailed in this paper. 



5.2 Building Non-relational Abstraction of Numerical Environments 

Implementing a fully verified relational abstract domain is a challenge in itself and 
it is not in the scope of this paper. We implement instead the previous interface 
with a standard non relational abstract environment of the form var -^ V^ where 
V^ abstracts numerical values. The notion of abstraction of numerical values is 
captured by the following interface. 

Record num_dom (t:Type) := { 

num_adom : adom t int; (* abstract domain structure *) 

meet: t ^ t ^ t+^; (* over-approximation of the concrete *) 

meet_sound: V x y, (7 x) H (7 y) ^7 (meet x y) ; (* intersection *) 
range: t -^ sign_flag -^ itv+±; (* signed/unsigned range *) 

range_sound: V x:t, 7 x C ints_in_range (range x) ; 
const: constant -^ t; const_sound: (*omitted*) ; 
f orward_unop : unary_operation ^ t ^ t+±; 
f orward_unop_sound: V op x, 

Eval_unop op (7 x) C 7 (f orward_unop op x) ; 
f orward_binop : (* omitted *) ; f orward_binop_sound: (* omitted *) ; 
backward_unop : (* omitted *) ; backward_unop_sound: (* omitted *) ; 
backward_binop : binary_operation ^ t ^ t ^ t ^ t+J_ * t+J_; 
backward_binop_sound : Vopxyz i j k, 

eval_binop op i j k ^ i G (7 x) ^ j G (7 y) ^ k G (7 z) -^ 

let (x',y') := backward_binop op x y z in 
iG(7x')AjG(7y') 
>. 

It is defined as a carrier t, an abstract domain structure numadom and a bunch of 
abstract transformers. Some operators are forward ones: they provide properties 
about the output of an operation. For instance, the operator const builds an 



abstraction of a single value. Some operators are backward ones: given some 
properties about the input and expected output of an operation, they provide a 
refined property about its input. Each operator comes with a soundness proof. 
We also implement a functor that lifts any abstraction of numerical values into 
a numerical environment abstraction. It relies on the functor for finite reduced 
maps that we have presented at the end of Section 3. Here, PTree provides an 
implementation of the TREE interface for the var type. 

NonRelDom.make (t) : num_dom t -^ int_dom ((PTree. t t)+^) 

The most advanced operator in this functor is the assume function. It relies 
on a backward abstract semantics of expressions. 

Fixpoint backward_expr (e:nexpr) (ab:t) (itv:Val) : t := 
match e with 

I ... 

I NEcond b 1 r ^ 
join 

(backward_expr b (backward_expr r ab itv) (const Nfalse)) 
(backward_expr b (backward_expr 1 ab itv) 

(backward_unop boolval (eval_expr b ab) (const Ntrue))) 
end. 

We just show and comment the case of conditional expressions. Given such an 
expression NEcond b 1 r, an abstract environment ab and the expected value itv 
of this expression, we explore the two branches of the condition. In one case, the 
condition b evaluated to Nfalse and the right branch r evaluated to itv. In the 
other case, the condition b evaluated to anything whose boolean value is Ntrue 
and the left branch 1 evaluated to itv. Then we have to consider that any of the 
two branches might have been taken, hence the join. 

Equipped with such backward operators, the analysis is then able to deal with 
complex conditions like the following: if (0 <= x && x < y && y < z && z < t 
&& t < u && u < v && v < 10). When analysing the true branch of this if, it is 
sound to assume that the condition holds. The backward operator will propagate 
this information and infer one bound for each variable. Since backward evaluation 
of conditions goes right to left, the following bounds are inferred: v < 10, ii < 9, 
t<8, z<7, ?/<6, and < x < 5. Unfortunately, no information is propagated 
from left to right. However applying again the assume function does propagate 
information between the various conditions. Iterating this process finally yields 
the most precise intervals for all variables involved in this condition. 

Notice that inferring such precise information is possible thanks to the avail- 
ability of complex expressions in the analyzed CFG program. Compare for 
example with Frama-C which, prior to any analysis, destructs boolean operations 
into nested if s; it is thus unable to give both bounds for each variable. 

5.3 Abstraction of Numerical Values: Instances and Functor 

We gave two instances of the numerical value abstraction interface: the intervals 
of signed integers and the intervals of unsigned integers. Several operations are 
defined on intervals together with their proofs of correctness. We have to take 



into account machine arithmetic. We do not try to precisely track integers that 
wrap-around intentionally. Instead we systematically test if an overflow may 
occur and fall back to top when we can't prove the absence of overflow. 



Definition repr (i 
Definition add (i j 



itv) : itv := if leb i top then i else top. 
itv) : itv : = 



repr { min := i.min + j .min; max := i.max + j.max}. 

We also rely on a reduction operator when the result of an operation may 
lead to an empty interval. Since our representation of intervals contains several 
elements with the same (empty) concretization, it is important to always use a 
same representative for them.^ 

Definition reduce (min max:Z): itv+^ := 

if min < max then NotBot (ITV min max) else Bot . 

Definition backward_lt (i j: itv): itv+± * itv+± := 
(meet i (reduce min_signed (j.max-1)), 
meet j (reduce (i.min+1) max_signed) ) . 

At run-time, there are no signed or unsigned integers; there are only machine 
integers that are bit arrays whose interpretation may vary depending on the 
operations they undergo. Therefore choosing one of the two interval domains may 
hamper the precision of the analysis. Consider the following example C program. 

1 int main (void) { signed s; unsigned u; 

2 if (*) u = 2^^ - 1; else u = 2^^ ; 

3 if (*) s = 0; else s = -1; 

4 return u + s ; } 

At the end of line 2, an unsigned interval can exactly represent the two values 
that the variable u may hold. However, the least signed interval that contains 
them both is top. Similarly, at the end of line 3, a signed interval can precisely 
approximate the content of variable s whereas an unsigned interval would be 
extremely imprecise. Moreover, comparison operations can be precisely translated 
into operations over intervals (e.g., intersections) only when they share the same 
signedness. Therefore, so as to get as precise information as possible, we need to 
combine the two interval domains. This is done through reduction. 

To combine abstractions of numerical values in a generic and precise way, we 
implement a functor that takes two abstractions and a sound reduction operator 
and returns a new abstraction based on their reduced product. 

Definition reduced_product (t t':Type) (N:num_dom t) (N':num_dom t') 
(Rireduction N N') : num_dom (t*t ' ) := (* omitted definition *) 

A reduction is made of an operator p and a proof that this operator is a sound 
reduction. 

Record reduction (A B:Type) (Nl:num_dom A) (N2:num_dom B) := { 
p: A+± -^ B+± ^ (A * B)+±; 
p_sound: V a b, (7 a) n (7 b) C 7 (p a b) } 



^ Otherwise the analyzer may encounter two equivalent values without noticing it and 
lose precision. 



Each operator of this functor is implemented by first using the operator of both 
input domains and then reducing the result with p. We hence ensure that each 
encountered value is systematically of the form p a b but we do not prove this 
fact formally, avoiding the heavy manipulation of quotients. Note also that, for 
soundness purposes, we do not need to prove that reduction actually reduces 
(i.e., returns a lower element in the abstract lattice)! 

6 Memory Abstraction 

The last layer of our modular architecture connects the CFG abstract interpreter 
with numerical abstract domains. It aims at translating every C feature into 
useful information in the numerical world. On the interpreter side, the interface 
with this abstract memory model is called mem_dom. It consists in trees made of 
CFG expressions and four basic commands forget, assign, store and assume. 

Record mem_dom (t:Type) := { (* abstract domain with concretization 
to local environment and global memory *) 
mem_adom: adom t (env * mem); 
(* consult the range of a local variable *) 
range: t -^ ident -^ sign_flag -^ itv+_L; 
range_sound: V ab e m x i , 

(e,m) G 7 ab ^ (e [x] = vint(i) V 3 b, e [x] = vptr(b,i)) -^ 

i G (ints_in_range (range ab x)); 
(* project the value of a local variable *) 
forget: ident ^ t ^ t; 

f orget_sound: V x ab , Forget x (7 ab) C 7 (forget x ab) ; 
(* assign a local variable *) 
assign: ident -^ expr ^ t ^ t; 

assign_sound: V x e ab, Assign x e (7 ab) C 7 (assign x e ab) ; 
(* assign a memory cell *) 

store: memory_chunk -^ expr -^ expr ^ t ^ t; 
store_sound: V At 1 r ab, 

Store />: 1 r (7 ab) C 7 (store />: 1 r ab) ; 
(* assume an expression evaluates to non-zero value *) 
assume: expr ^ t ^ t; 

assume_sound: V e ab , Assume e (7 ab) C 7 (assume e ab) 
}. 

Our final analyzer is parameterized by a structure of this type. 

value_analysis (t:Type) : mem_dom t -^ 

program -^ node -^ (ident -^ sign_flag -^ Interval . itv +1^ 

A structure of type memdom is built with a functor of the following form. 
AbMem.make (t:Type) : int_dom t -^ mem_dom (t*type_inf o) 

The numerical abstraction is associated with a flow sensitive type information 
(of type typeinf o) that we compute at the same time. This type information 
tries to recover some information to disambiguate integer and pointer values. The 
abstract domain is built using the product functor presented in Section 3. The 
concretization function of the numeric domain is lifted from a concretization of 
type t -^ ^ (varmint) to a concretization of type t -^ V(env * mem) with the 
following definition. ^ 



The types env and mem are introduced in Section 2. 



Definition gamma_mem (ab:t) := A (e ,m) : (env*mem) . 
3 yoivar -^ int , p G (7 ab) A 

(V X i , (e[x] = vint(i) V 3 b, e [x] = vptr(b,i)) ^ p x = ±) . 

For each transfer function that takes as argument a C expression, we convert 
it into a numerical expression in order to feed the numerical abstract domain. 
For instance, the assign operator takes the following form. 

Definition assign (id:ident) (e:expr) (ab : t*type_inf o) : t*type_info := 
let (nm,tp) := ab in 

(* convert expression e into a numeric form using type infos *) 
match convert tp e with 

I None => forget id ab (* if we fail, we just project *) 
I Some ne ^ 

(* otherwise we call the numerical assignment operator *) 
(num. assign id ne nm, ... (* type info update omitted *)) 
end. 

Removing some ambiguity between pointers and integers is mandatory for 
soundness. As an example, consider the unsigned equality expression (x ==u y). 
For the sake of precision of the analysis, it is important to convert it into a simple 
numerical equality x == y before using the assume operator of the numerical 
abstract domain. However if x contains a numerical value and y a pointer, the 
first formula is always false while assuming the second formula in the numerical 
world would lead to a spurious assumption about the offset of the pointer in y. 



7 Experimental Evaluation 

Our verified value analyzer takes as input a CFG program and outputs ranges for 
every variable at every point of the program. Our formal development adds about 
7,500 lines of Coq code (consisting of 4,000 lines of Coq functions and definitions 
and 3,500 lines of Coq statements and proof scripts) and 200 lines of OCaml to 
the 100,000 lines of Coq and 1,000 lines of OCaml provided in CompCert 1.11. 

We have conducted some experiments to evaluate the precision and the 
efficiency of our analyzer. Indeed, an analyzer that always returns "top" is easily 
proved correct, but useless. It is therefore important to distinguish between 
bounded and unbounded variables. Moreover, a precise but non-scalable analyzer 
has limited applicability. In order to evaluate the precision and efficiency of our 
value analysis, we use the OCaml extracted code to compile our benchmark 
programs into CFG programs and to run our analyzer on them. 

We compare our analyzer to two interval-based analyzers operating over C 
programs: a state-of-the-art industrial tool, Frama-C [13], and an implementation 
of a value-range analyzer [ ] . Frama-C is an industrial-strength framework for 
static analysis, developed at CEA. It integrates an abstract interpretation-based 
interprocedural value analysis on interval domains with congruence, k-sets and 
memory analysis. It operates over C programs and has a very deep knowledge of 
its semantics, allowing it to slice out undefined behaviors for more precise results. 
It currently does not handle recursive functions. The value-range analyzer, which 
will be referred to as Wrapped is described in [24] . It relies on LLVM and operates 



over its intermediate representation to perform an interval analysis in a signedness- 
agnostic manner, using so-called "wrapped" intervals to deal with machine integer 
issues such as overflows while retaining precision. It is an intraprocedural tool, 
but can benefit from LLVM's inlining to perform interprocedurally in the absence 
of recursion. 

The 3 tools have been compared on significant C programs from CompCert's 
test suite. They range from a few dozen to a few thousand statements. To 
relate information from different analyses, we annotated the programs to capture 
information on integer variables at function entries and exits and at loops 
(for iteration variables). This amounts to 545 annotations in the 20 programs 
considered. For each program point, we counted the number of bounded variables. 
We consider as bounded any variable whose inferred interval has no more than 2^^ 
elements, and hence rule out useless intervals like xG [—2^^ ,2^^ — 2] , inferred after 
a guard like x<y. Finally, to be able to compare the results of an interprocedural 
analysis with those of two intraprocedural analyses with inlining, we considered 
for each annotation the union of the intervals of all call contexts. Less than 
10% of intervals present a union of different intervals, and among those several 
preserve the boundedness for all contexts. Overall, its impact on the results is 
negligible. 

The results are shown in Figure 5, which displays the number of bounded 
variables per program and per analyzer. In total, Frama-C bounded 398 variables, 
our analyzer got 355, and Wrapped ended up with 305. The main differences 
between our analyzer and Frama-C, especially on the larger benchmarks (izw, 
arc ode and Izss) result from global variable tracking and congruence information. 
Such reasoning is not handled by our analyzer. On the other hand, the precision 
of our product of signed and unsigned domains allows us to bound more variables 
(e.g., on fannkuch), where Wrapped also obtains a good score, mainly due to 
variables bounded as [0, 2^^ — 1] and similar values. Some issues with the inlining 
used by Wrapped explain its worse results in fft, knucleotide and spectral. 
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Fig. 5. Number of bounded intervals (bounded per program and analyzer). 



We also compared the execution times of the analyses. Overall, our analysis 
runs faster than Frama-C because we track less information, such as pointers 
and global variables. For programs without these features, both analyses run in 
roughly the same time, from a few tenths of seconds for the smaller programs 
up to a few seconds for the larger ones. Wrapped's analysis is faster than the 
others. On a larger benchmark (over 3,000 Hues of C code and about 10,000 CFG 
instructions after inlining) our analysis took 34 seconds to perform. 

It is hard to draw final conclusions about the precision of our tool from 
these experiments. Frama-C, for instance, is likely to perform better on specific 
industrial critical software for which it has been specially tuned. Nevertheless 
we give evidence that our analyzer performs non-trivial reasoning over the C 
semantics, close to that of state-of-the-art (non- verified) tools. 

8 Related Work 

While mechanization of research paper proofs attracts an increasing number 
of practitioners, it should not be confused with the activity of developing a 
formally verified compiler or static analyzer. Our work is initially inspired by the 
achievement of the CompCert compiler [ ] but we target the area of abstract- 
interpretation-based static analyzers. 

Previous work on mechanized verification of static analyses has been mostly 
based on classic data fiow frameworks: Klein and Nipkow instantiate this frame- 
work for inference of Java bytecode types [i^]; Coupet-Grimal and Delobel [ ] and 
Bertot et al [3] for compiler optimizations, and Cachera et al. [7] for control fiow 
analysis. Vafeiadis et al. [ ] rely on a simple data fiow analysis to verify a fence 
elimination optimization for concurrent C programs. Compared to these prior 
works, our value analysis relies on fixpoint iterations that are accelerating with 
widening operators. Cachera and Pichardie [ ] and Nipkow [ ] describe a verified 
static analysis based on widenings but their technique is restricted to structured 
programs and targets languages without machine arithmetic nor pointers. Leroy 
and Robert [22] have developed a points-to analysis in the CompCert framework. 
This static analysis technique is quite orthogonal to what we formalize here. 
Their verified tool is not compared to any existing analyzer. Hofmann et al. [7] 
provide a machine- checked correctness proof in Coq for a generic post-fixpoint 
solver named RLD. The formalized algorithm is not fully executable and cannot 
be extracted to OCaml code. 

Of course the area of non- verified static analysis for C programs is a broader 
topic. In our context, the most relevant and inspiring works are the static analyses 
devoted to a precise handling of signed and unsigned integers [^7,24] and the 
Astree static analyzer [ ]. Our current formalization is directly inspired by 
Astree's design choices, trying to capture some of its key interfaces. Our current 
abstract memory model is aligned with the model developed by Mine [23] because 
we connect a C abstract semantics with a generic notion of numerical abstract 
domain. Still our treatment of memory is simplified since we only track values of 
local variables in the current implementation of our analyzer. 



9 Conclusion 

This work provides the first verified value analysis for a realistic language as C. 
Implementing a precise value analysis for C is highly error-prone. We hope that 
our work shows the feasibility of developing such a tool together with a machine- 
checked proof. The precision of the analysis has been experimentally evaluated and 
compared on several benchmarks. The paper's technology performs comparably 
to existing off-the-shelf (unverified!) tools, Frama-C [ ] and Wrapped [24]. Our 
contribution is also methodological. Our formalization, its lightweight interfaces 
and its proofs can be easily reused to develop different formally verified analyses. 
Now that the main interfaces are defined, we expect to improve our analyzer 
in several challenging directions. First, we want to replace the current memory 
abstraction with a domain similar to Mine's memory model [23]. Verifying such a 
domain raises specific challenges not only in terms of semantic proofs but also in 
terms of efficient implementation of the transfer functions. Without special care, 
the domain may not be able to scale to large enough programs. We also intend to 
connect relational abstract domains to the interface for numerical environments. 
We would like to develop efficient validation techniques following Besson et al. [ ] 
approach and test their efficiency on large programs. The last and important 
challenge concerns floats. Astree relies on subtle reasoning and manipulation on 
floats. CompCert has recently been enhanced with a fully verifled implementation 
of floating-point arithmetic [ ] and we hope to be able to incorporate them in 
our own value analysis. 
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